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A parallel implementation of an eigensolver designed for electronic structure calculations is pre¬ 
sented. The method is applicable to computational tasks that solve a sequence of eigenvalue prob¬ 
lems where the solution for a particular iteration is similar but not identical to the solution from 
the previous iteration. Such problems occur frequently when performing electronic structure calcu¬ 
lations in which the eigenvectors are solutions to the Kohn-Sham equations. The eigenvectors are 
represented in some type of basis but the problem sizes are normally too large for direct diagonal- 
ization in that basis. Instead a subspace diagonalization procedure is employed in which matrix 
elements of the Hamiltonian operator are generated and the eigenvalues and eigenvectors of the re¬ 
sulting reduced matrix are obtained using a standard eigensolver from a package such as LAPACK 
or SCALAPACK. While this method works well and is widely used, the standard eigensolvers scale 
poorly on massively parallel computer systems for the matrix sizes typical of electronic structure 
calculations. We present a new method that utilizes a partitioned folded spectrum scheme (PFSM) 
that takes into account the iterative nature of the problem and performs well on massively parallel 
systems. Test results for a range of problems are presented that demonstrate an equivalent level 
of accuracy when compared to the standard eigensolvers, while also executing up to an order of 
magnitude faster. Unlike O(N) methods, the technique works equally well for metals and systems 
with unoccupied orbitals as for insulators and semiconductors. Timing and accuracy results are 
presented for a range of systems, including a 512 atom diamond cell, a cluster of 13 C60 molecules, 
bulk copper, a 216 atom silicon cell with a vacancy, using 40 unoccupied states/atom, and a 4000 
atom aluminum supercell. 


I. INTRODUCTION 

Electronic structure calculations based on density 
functional theory are widely used in physics and materi¬ 
als science. These calculations usually involve numerical 
solutions of the Kohn-Sham equations for a set of eigen¬ 
values and eigenvectors. 

H ks [il)i] = + Veffipi = A iifi i = 1,7V (1) 

where 

be// Vhartree T 14 c T ^ionic (2) 

Self consistent pseudopotentialsi -3 of various forms are 
used to represent Vi on i C while Vhartree and Vxc are the ex¬ 
change correlation and hartree potentials. A typical solu¬ 
tion process proceeds by generating an initial set of elec¬ 
tronic orbitals and an associated charge density. A linear 
combination of atomic orbitals from the constituent ions 
is often used for this together with a superposition of 
atomic charge densities as a starting density. The initial 


wavefunctions may be represented by expansions in basis 
functions, such as plane waves^, augmented waves^ or 
grid based discretizations^—. The initial density is then 
used to compute exchange correlation and hartree poten¬ 
tials using functionals of the electron density^, which 
along with the ionic potentials form 14//. This repre¬ 
sentation of 14// is then substituted back into the Kohn- 
Sham equations that are solved for a new set of electronic 
orbitals. This process is repeated until self consistency 
is reached, where self consistency is defined as the dif¬ 
ference between 14//(in) and V e ff(out) potentials being 
less than a specified tolerance. 

A variety of methods have been used to solve Eq. m 
for each self consistent step. Many of these methods in¬ 
clude a subspace diagonalization and solution of a real 
symmetric eigenvalue problem as part of the procedure. 
For large systems this will normally be the most expen¬ 
sive part of the calculations as the work required scales as 
the cube of the number of electronic orbitals. Since this is 
the main computational bottleneck for large DFT calcu¬ 
lations, considerable work has been carried out to either 
eliminate the cubic scaling terms entirely or reduce their 
prefactor. Algorithms that fall into the first category of- 
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ten rely on the restricted spatial range made possible by 
real space methods. These include both grid base d — ’ —, 
and localized basis function approaches.*!- 3 The present 
work, which is a partitioned folded spectrum 14 * method 
(PFSM) falls into the second category and works by dis¬ 
tributing eigenpairs over independent processing nodes. 
Eigenpairs on a given node are solved for using submatrix 
diagonalizations and an iterative folded spectrum tech¬ 
nique is used to couple the submatrix solutions to the 
full matrix. No distinction is made between occupied 
and unoccupied states, in contrast to the spectrum slic¬ 
ing method developed by Schofield et al— where only 
the occupied eigenspectrum is divided into discrete slices. 
These are distributed over the nodes and filtering func¬ 
tions are used to restrict the eigenpairs on a specific slice 
to a discrete range of the occupied orbitals. The current 
method also shares some similarities with the shift and 
invert parallel spectrum transformation (SIPs) used by 
Zhang et al— for tight binding calculations, which gener¬ 
ate sparse matrix eigenproblems as opposed to the dense 
matrices addressed in the current work. Their approach 
takes advantage of the sparsity to reduce the scaling fac¬ 
tor from 0(N 3 ) to 0(N 2 ) for sufficiently sparse matrices. 

Since direct diagonalization of Eq. m is difficult, a sub¬ 
space diagonalization procedure is employed in which 
matrix elements of the Hamiltonian operator are gen¬ 
erated. The eigenvalue problem for the resulting reduced 
matrix is then given by. 

Axi = XiXi i = 1, N (3) 

This can be solved using a variety of standard library 
routines. These include LAPACK and/or MAGMA— 
routines, which utilize a single processing node, and 
ScaLapack, which is designed to use multiple nodes. 
When running on multiple nodes of a supercomputer or 
cluster, LAPACK is a poor choice since it can only utilize 
the processing power of a single node. On a system with 
GPU’s, the Magma libraries are a viable alternative for 
N ranging up to a few thousand. While Magma can cur¬ 
rently only apply the processing power of a single node, 
GPU’s are well suited for this type of problem and are 
so fast that solutions may be achieved in an acceptable 
amount of time. For problems where N is greater than 
a few thousand, ScaLapack usually becomes the best 
choice. The exact size of N depends on the relative bal¬ 
ance of computational power to communications speed 
on the computer system being used. Unfortunately, for 
any given value of N a point will be reached where the 
performance of ScaLapack scales poorly as the number 
of nodes is increased due to communications overhead. 
Current implementations of ScaLapack are also unable 
to effectively utilize GPU’s, leaving large amounts of pro¬ 
cessing power unused. 

The current work is designed to address the shortcom¬ 
ings of these methods. The main design goal is to mini¬ 
mize the wall clock time required for a solution of Eq. ([3]). 
Exceptional efficiency in terms of the total operations 


count is not necessarily needed to achieve this. Instead, 
the focus is on effectively using as much of the processing 
power available, as opposed to the current methods that 
may leave much of that power untapped on large clusters. 

II. METHODOLOGY 

Let {A;} and {ui} be the eigenvalues and eigenvectors 
of A. We order A* in distance from the target e. So 

|Ai - e\ < |A 2 - e| < |A 3 - e| < ....|A„ - e|. (4) 

The eigenvector corresponding to the eigenvalue closest 
to e may be determined with the iteration. 

x n+1 = {I^a(A-eI) 2 )x n (5) 

This is just the power method for the matrix. 

I — a(A — el) 2 (6) 

and it will converge to the desired eigenvalue 

l-a(Ai-e) 2 , (7) 

if the eigenvalues of I — a(A — el) 2 are ordered so that 

for each n 

a|A„ - e| < 1 (8) 

and 


Then 

|1 - a(X n - e) 2 | < |1 - a(X n -i - e) 2 | < ... 

< ...|1 — a(A 2 — e) 2 | < |1 — a(Ar — e) 2 | 

for 1 < i < N — 1. 

If © is satisfied then given an initial estimate of an 
eigenvector x\ and a shift value e, the iterative process 
will produce an eigenvector with an eigenvalue close to e. 

x{ +1 = x\ — a(A — el) 2 xl 0 < a < 1 (11) 

This may be implemented via a sequence of matrix- 
vector products which eliminates the need to explicitly 
square the matrix A — el. Since the iterative process is 
only guaranteed to produce eigenvectors for eigenvalues 
close to the shift value e, we must also have a way of 



3 


generating a suitable set of shifts and initial eigenvectors. 
We do this by solving a sequence of eigenvector problems 
generated from overlapping diagonal submatrices of the 
original matrix A. 

In particular, each diagonal submatrix B is of order M 
where M = 7 IV and 0 < 7 < 1. Since the work required 
to solve each submatrix scales as M 3 , each individual 
submatrix requires 7 3 less work. The total computa¬ 
tional work required then depends on the magnitude of 
7 and the number of submatrices needed. Even when 
this is larger than the work required by a standard solver 
such as LAPACK or ScaLapack, the wall clock time re¬ 
quired for a solution can be significantly less because the 
algorithm is better adapted for massively parallel com¬ 
puter architectures. By assigning individual submatrices 
to smaller blocks of processing nodes, the scalability con¬ 
straints found in the standard solvers are alleviated. In 
our tests we have found for values of N ranging up to a few 
thousand (common in electronic structure calculations), 
assigning one submatrix per processing node works well 
on machines with GPU accelerators. For this case using 
a computing platform with P processing nodes, a slice of 
eigenvectors of size m = N/P may be assigned to each 
node as shown in Fig. ©■ 



FIG. 1. The region with coarse grained cross hatching rep¬ 
resents the set of eigenvectors assigned to an individual pro¬ 
cessing node. The region with fine grained cross hatching 
represents both submatrix B and the set of eigenvectors as¬ 
sociated with a solution of Byi = Xyi 

The eigenvector solutions yt for the submatrices are 
used to construct the starting vectors x 3 for Eq. m and 
the corresponding eigenvalues are used as the shifts e. 
Since M < N, the elements of x t that do not have a 
corresponding component in the submatrix solution are 
set to zero. The iterative process from Eq. HID is then 
applied and after all eigenvectors have been processed, 
Gram-Schmidt orthogonalization is used to ensure or¬ 


thonormality of the resulting eigenvectors. Note that 
since this process is applied on a subspace and is readily 
parallelized, it represents only a moderate fraction of the 
total work. 

This method will not work reliably for an arbitrary ma¬ 
trix. This may be demonstrated by taking a set of eigen¬ 
vectors that satisfy Eq. © and randomly mixing them to 
generate A. In this case the eigenvectors of the subma¬ 
trices will consist of linear combinations of eigenvectors 
from the full spectrum and the iterative process described 
in Eq. m can produce nearly identical estimates for the 
submatrix that will not span the full spectrum. 

A key condition for success then is that A be diag¬ 
onally dominant. We have generally found that using 
standard diagonalization for the first 3 to 5 SCF steps is 
sufficient when the submatrix widths are 30 percent of 
the full width of A and the number of eigenvectors com¬ 
puted via Eq. (11) from each submatrix is less than 10 
percent of the full width. We thus expect major savings 
during the usual applications of DFT, such as geometry 
optimization and ab initio molecular dynamics, in which 
only the first ionic step would require some standard di- 
agonalizations, while PFSM would be used for the many 
subsequent steps. 


III. ACCURACY AND CONVERGENCE 

A variety of tests were performed to determine both 
accuracy and the effect on convergence rates of the new 
method. The RMG^ code developed at NCSU was used 
for all tests. The first system consists of a 512-atom dia¬ 
mond cell with a Vanderbilt^ ultrasoft pseudopotential 
used to represent the ionic cores. Test runs using sub¬ 
matrix widths ranging from 0.1 to 0.3 show that a width 
of 0.3 produces results identical to those obtained using 
standard full diagonalization. While the size of this sys¬ 
tem is typical of production SCF calculations, it is a per¬ 
fect crystal with highly degenerate eigenvalues. Many 
systems of interest are disordered and we expect that 
they should actually be easier to handle computationally 
because the degeneracies will normally be broken in such 
systems. 

As an example of a disordered system we selected a 
cluster of 13 C60 molecules in a non-equilibrium FCC 
structure. This system again exhibits identical results be¬ 
tween standard full diagonalization and the new method 
once the submatrix width reachs 0.3. 

Since diamond and c60 both have bandgaps, bulk cop¬ 
per is selected as an example of a metallic system. Or¬ 
bital occupations near the Fermi level are assigned using 
a Fermi-Dirac distribution with /i = 0.04 eV. In this case 
the overall convergence rate was slower, as expected for a 
metal, but the submatrix width required in order for the 
PFSM results to match those from full diagonalization 
is reduced from 0.3 to 0.2. A similar reduction of the 
required submatrix width has also been observed with 
other metallic systems, which we attribute to the break- 











FIG. 4. Convergence rates for a 256 atom copper cell for 
various submatrix widths. 



FIG. 3. Convergence rates for a cluster of 13 C60 molecules 
(780 atoms total) for various submatrix widths. 


viding up to 9.6 GBytes/sec of bidirectional bandwidth. 
The current version of PFSM used in the RMG code is 
targeted towards systems where N < 10000. This limit 
is not specific to the algorithm, but rather is a conse¬ 
quence of the current implementation that assigns each 
submatrix to a single processing node. As this problem 
size is quite common in electronic structure calculations, 
the method is widely applicable. 

Several test systems were selected to illustrate the po¬ 
tential performance gains. The timings depend on the 
submatrix width, which is always selected so that conver¬ 
gence rates and final results are essentially identical for 
PFSM and full diagonalization. Results for the 256 atom 
copper cell that were used in the convergence tests illus¬ 
trated in Fig.® are presented in Table Q] The improve¬ 
ment in the eigensolver is a factor of 2.28 for MAGMA 
and 14.16 for LAPACK. The overall speedup factors are 
1.08 and 2.72 respectively. 


ing of degeneracies due to the Fermi-Dirac distribution 
of the orbital occupations. 


IV. PERFORMANCE AND PARALLELIZATION 

The PFSM method can use different types of eigen- 
solvers for the submatrices and RMG currently supports 
LAPACK and MAGMA. Improvements in execution time 
depend on problem size and type, and the architecture 
of the target platform. As the bulk of the execution time 
for the RMG code is normally spent in the multigrid 
preconditioner, which exhibits 0(N 2 ) scaling, and the 
eigensolver that exhibits 0(N 3 ) scaling, we expect that 
the overall speedup will increase as N increases and this 
is indeed observed. The results presented in the following 
tests use the Bluewaters system, a Cray XK7 located at 
NCSA, where each processing node consists of a multi¬ 
core Opteron CPU and a single Nvidia K20X GPU. The 
nodes are connected via a 3-D torus, with each link pro- 



MAGMA 

ScaLapack 

LAPACK 

MAGMA 

PFSM 

LAPACK 

PFSM 

Total time 

5.93 

6.52 

15.36 

5.46 

5.64 

Eigensolver 

1.23 

1.83 

10.34 

0.54 

0.73 


TABLE I. Time per SCF step (seconds) for 256 atom copper 
cell with 256 unoccupied orbitals, N=1664, using 64 Cray 
XK7 nodes. 

Results for the next system are presented in Table 
nn These results were generated for a 216 atom silicon 
vacancy with approximately 40 unoccupied states per 
atom. This provides a good example of the performance 
improvements possible for systems with a large number 
of unoccupied states, which are common in GW calcu¬ 
lations. In contrast to the previous result, where the 
MAGMA eigensolver was faster than ScaLapack, we now 
see better performance from ScaLapack. This is not sur¬ 
prising since the MAGMA result is being generated by a 
single GPU which is saturated for this value of N. The 
Eigensolver portion of the MAGMA-PFSM result is in 
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turn more than 5 times faster than the pure MAGMA 
result and 2.6 times faster than the ScaLapack result. 
We also note that the eigensolver results for LAPACK- 
PFSM are more than 20 times faster than the LAPACK 
result and indeed are nearly twice as fast as ScaLapack. 



MAGMA 

ScaLapack 

LAPACK 

MAGMA 

PFSM 

LAPACK 

PFSM 

Total time 

69.07 

43.39 

317.23 

25.30 

30.40 

Eigensolver 

52.04 

26.75 

296.67 

10.36 

14.23 


TABLE II. Time per SCF step (seconds) for 216 atom silicon 
vacancy 40 unoccupied orbitals/atom, N=9136, using 64 Cray 
XK7 nodes. 

The last test case presented in Table Hill is a 4000 atom 
aluminum cell running on 512 Cray XK7 nodes. The 
number of unoccupied states per atom is less than 1, 
which leads to a value of N that is only slightly larger 
than in the previous system, even though the cell vol¬ 
ume is 14.9 times larger. The Eigensolver portion of 
the ScaLapack calculation is 2.35 times faster than the 
MAGMA result, while the MAGMA-PFSM result is 6.15 
times faster than the pure MAGMA result. 



MAGMA 

ScaLapack 

LAPACK 

MAGMA 

PFSM 

LAPACK 

PFSM 

Total time 

64.75 

40.28 

351.16 

25.27 

35.78 

Eigensolver 

45.25 

19.27 

323.17 

7.35 

12.40 


TABLE III. Time per SCF step (seconds) for 4000 atom alu¬ 
minum supercell, N=9216, using 512 Cray XK7 nodes. 


V. IMPLEMENTATION 

The matrix elements of Eq© are generated as shown 
below. 


(ipi\H ks \ipj) = \i {i/jili/jj) i,j = l,N (12) 
This produces a generalized eigenvalue problem. 

Axi = XiBxi i = l,N (13) 

Which is normally solved using a standard LAPACK 
or ScaLapack routine such as DSYGVD or PDSYGVD. 
These routines convert the generalized form into the stan¬ 
dard form ofEq.© before solving for the eigenpairs. We 
use a similar approach in the PFSM but take advantage 
of specific characteristics of the problem to do the con¬ 
version efficiently on massively parallel computer archi¬ 
tectures. Conversion to standard form is equivalent to 
multiplying both sides of Ea. (fT3l) by B~ l . The explicit 
computation of B 1 is not required however and since 
B is diagonally dominant the identity serves as a good 


initial approximation, (B ^ I at convergence due to ul- 
trasoft pseudopotentials^). The initial guess may be im¬ 
proved using an iterative process. Let 

Z n+1 = {I - D 1 B)Z n + D x Ax (14) 

where D is the diagonal of B. Then at convergence 

Z = Z D X BZ + D~ x Ax (15) 

therefore BZ = Ax and Z = B X A. This iteration 
has the desirable property that the individual columns 
of Z can be computed independently. Given P computa¬ 
tional nodes, we assign N/P columns of Z to each node 
and achieve excellent scalability. After the completion of 
this step an MPI_Allgatherv call is used to ensure that 
all processing nodes have a copy of Z that is now used 
in Eq@. 

The matrix Z obtained in Eq. m is then decom¬ 
posed over processing nodes in the manner illustrated 
in Fig. (JT]). While the current implementation only uses 
MAGMA or LAPACK running on a single node to solve 
each submatrix, this is not an actual limitation of the 
method. Groups of processing nodes running ScaLapack 
or some other form of distributed eigensolver could also 
be used to solve the submatrices, work is under way to 
implement this. We expect that this will only be ad¬ 
vantageous when the value of N exceeds 15,000, because 
for submatrix sizes of M < 3000 (corresponding to full 
matrix sizes of 10000 to 15000), MAGMA running on a 
single node is the fastest option for computing the sub¬ 
matrix solutions on the Cray XK7. One consideration 
when using MAGMA is that the submatrix size is con¬ 
strained by the amount of GPU memory available on a 
single node. For the XK7 this is 6GB and a considerable 
fraction of that memory is reserved for other datastruc- 
tures. Consequently, the present RMG implementation is 
currently restricted to values of N < 10000. We expect to 
expand this to values of N ~ 35,000 for the XK7 via im¬ 
proved memory management. Future architectures, such 
as the Summit system scheduled to be installed at ORNL 
in 2017, will have a larger coherent memory space acces¬ 
sible to the GPU accelerators (up to 512 GBytes/node), 
and should be able to handle much larger systems. 

VI. SUMMARY 

We have developed a new method, referred to as 
PFSM, for computing the eigenpairs of matrices of the 
type typically found in ab-initio electronic structure cal¬ 
culations. PFMS exploits specific characteristics of the 
problem, and the available hardware architectures to 
achieve up to an order of magnitude improvement in 
the eigensolver execution time. The improved speed can 
be obtained with no sacrifice in accuracy or convergence 
rates, and is particularly well suited to metallic systems. 
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The source code for PFSM is currently available in the 
latest stable release of RMG, which may be obtained at 
http:://rmgdft.sourceforge.net. It is integrated into the 
main code base and work is underway to move it into 
a separate library with a calling interface similar to the 
standard eigensolvers. This should make it easy to incor¬ 
porate the method into other electronic structure codes 


with minimal work. 
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